library(Stat2Data)
library(regplanely)
library(ggplot2)
library(performance)
data("Diamonds")
head(Diamonds)
## Carat Color Clarity Depth PricePerCt TotalPrice
## 1 1.08 E VS1 68.6 6693.3 7228.8
## 2 0.31 F VVS1 61.9 3159.0 979.3
## 3 0.31 H VS1 62.1 1755.0 544.1
## 4 0.32 F VVS1 60.8 3159.0 1010.9
## 5 0.33 D IF 60.8 4758.8 1570.4
## 6 0.33 G VVS1 61.5 2895.8 955.6
ggplot(data = Diamonds, mapping = aes(y = TotalPrice,
x = Carat)) +
geom_point() +
# geom_smooth(method = lm, formula = y ~ x) +
theme_bw()
model = lm(TotalPrice ~ Carat * PricePerCt, data = Diamonds)
summary(model)
##
## Call:
## lm(formula = TotalPrice ~ Carat * PricePerCt, data = Diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.112023 -0.025374 0.002803 0.026619 0.080538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.096e-02 8.836e-03 -1.240e+00 0.2158
## Carat 6.831e-03 1.049e-02 6.510e-01 0.5153
## PricePerCt 2.936e-06 1.629e-06 1.802e+00 0.0724 .
## Carat:PricePerCt 1.000e+00 8.663e-07 1.154e+06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03583 on 347 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.501e+12 on 3 and 347 DF, p-value: < 2.2e-16
check_model(model, check = c("qq", "normality", "linearity", "homogeneity"))
while there are some mild violations of the assumptions made for the
model, especially as fitted values increase in magnitude, the model is
mostly sound.
regression_plane(model)
this model has an adjusted \(R^2\) value of 1, which means a great deal of the variation is explained by our model. this model tell us that for each increase in increase in Carat by one, price increases by 1 + 6.831e-03.
library(tidyverse)
library(ggplot2)
library(dplyr)
lung_volume = read.csv("https://bit.ly/lung_volume")
head(lung_volume)
## age fev ht sex smoke
## 1 9 1.708 57.0 F No
## 2 8 1.724 67.5 F No
## 3 7 1.720 54.5 F No
## 4 9 1.558 53.0 M No
## 5 9 1.895 57.0 M No
## 6 8 2.336 61.0 F No
ggplot(data = lung_volume, aes(x = smoke, y = fev)) +
geom_boxplot() +
theme_bw()
Visualize a regression model that estimates how smoking affects FEV, taking into account a persons age. Your model should allow the effect of smoking to depend on a person’s age.
ggplot(data = lung_volume, mapping = aes(y = fev,
color = smoke,
x = age)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = F) +
theme_bw()
How does age affect lung volume among non-smokers? How does age affect lung volume among smokers?
smoking_model = lm(fev ~ age * smoke, data = lung_volume)
summary(smoking_model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2533955 0.08265075 3.065859 2.260474e-03
## age 0.2425584 0.00833154 29.113274 6.504859e-120
## smokeYes 1.9435707 0.41428463 4.691390 3.309624e-06
## age:smokeYes -0.1627027 0.03073753 -5.293290 1.645078e-07
age:smokeYes is the change in slope for the
smokers, indicating that for each increase in age by
one, the fev value increases by -0.163 + 0.243 =
0.080.age is the slope for the baseline group, aka
non-smokers, indicating that for each increase in age
by one, fev increases by 0.243.Do you believe that smoking changes the relationship between age and lung volume? Support your answer using information from the regression table
yes – this model indicates that for people who smoke, as they get
older their lung capacity actually begins to decrease, whereas for
non-smokers, their lung capacity tends to increase as they get older.
the p-values in this table indicate that the difference in the change in
slope between the non- and smokers is significant
(age:smokeYes = 1.645e-07).
Is there a difference in lung volume between an averaged age person who smokes and an average aged person who doesn’t smoke?
the existing model is based on smokers being 0 years, but we can mean-recenter the X value to compare lung volumes when \(age - \bar{x}_{age} = 0\), which is when age = average age.
ggplot(data = lung_volume, aes(x = age - mean(age),
y = fev,
color = smoke)) +
geom_point() +
geom_smooth(se = F, method = lm, formula = y ~ x) +
theme_bw()
lung_volume = mutate(lung_volume,
mc_age = age - mean(age))
mc_lung_model = lm(fev ~ mc_age * smoke, data = lung_volume)
summary(mc_lung_model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6622898 0.02305217 115.489795 0.000000e+00
## mc_age 0.2425584 0.00833154 29.113274 6.504859e-120
## smokeYes 0.3277391 0.12861470 2.548225 1.105618e-02
## mc_age:smokeYes -0.1627027 0.03073753 -5.293290 1.645078e-07
this new model tells us the change in slope for average-aged smokers is 0.327 more than the change in slope for average-aged non-smokers. we can tell that difference is significant because the the p-value for that difference is 0.011 < 0.05.
library(tidyverse)
library(moderndive)
library(performance)
nyc_airport_weather = read.csv("https://bit.ly/nyc_airport_weather")
nyc_airport_weather = nyc_airport_weather %>%
filter(!is.na(avg_temp))
head(nyc_airport_weather)
## month airport avg_temp avg_humid
## 1 1 EWR 35.56216 62.12451
## 2 1 JFK 35.38555 61.66162
## 3 1 LGA 35.95927 59.15609
## 4 2 EWR 34.26332 63.33558
## 5 2 JFK 34.19246 62.55848
## 6 2 LGA 34.35612 60.68603
Visualize a parallel slopes model that uses temperature and airport to predict the humidity level
ggplot(data = nyc_airport_weather, mapping = aes(y = avg_humid,
x = avg_temp,
color = airport)) +
geom_point() +
geom_parallel_slopes(se = F) +
theme_bw()
Fit the same parallel slopes model you visualized and report the regression table
airport_model = lm(avg_humid ~ avg_temp + airport, data = nyc_airport_weather)
summary(airport_model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.9468963 2.74595125 20.374322 1.648654e-19
## avg_temp 0.1280757 0.04549216 2.815337 8.394683e-03
## airportJFK 2.2716649 1.74601374 1.301058 2.028265e-01
## airportLGA -3.7458961 1.74800212 -2.142959 4.007585e-02
check_model(airport_model, check = c("qq", "normality", "linearity", "homogeneity"))
from this check, we see that there are some mild assumption violations in the normality of residuals tests, but otherwise, this model is mostly appropriate.
Match each of the labeled line segments in the figure below to the coefficient in the equation that it best represents, and also note the coefficient’s fitted from the regression table.
What is the estimated humidity at Newark Liberty International Airport when the average monthly temperature is zero degrees?
new_data = data.frame(avg_temp = 0, airport = "EWR")
predict(airport_model, new_data)
## 1
## 55.9469
Imagine that a flight that was schedule to land at Newark Liberty International Airport got diverted to LaGuardia airport, and the temperature was 64 degrees. Would it be more humid or less humid at LaGuardia at Newark when the flight lands? By how much?
predict(airport_model, data.frame(avg_temp = 64, airport = "EWR"))
## 1
## 64.14374
predict(airport_model, data.frame(avg_temp = 64, airport = "LGA"))
## 1
## 60.39785
based on this model, we can predict that it would be less humid at LGA when the plane lands, than it would be at EWR, given that the temperature was 64 at both.
library(Stat2Data)
library(ggplot2)
data("BaseballTimes2017")
head(BaseballTimes2017)
## Game League Runs Margin Pitchers Attendance Time
## 1 CHC-ARI NL 11 5 10 39131 203
## 2 KCR-CHW AL 9 3 7 18137 169
## 3 MIN-DET AL 13 5 10 29733 201
## 4 SDP-LAD NL 7 1 6 52898 179
## 5 COL-MIA NL 9 3 10 20096 204
## 6 CIN-MIL NL 21 1 10 34517 235
pitchers_model = lm(Time ~ Pitchers, data = BaseballTimes2017)
summary(pitchers_model)
##
## Call:
## lm(formula = Time ~ Pitchers, data = BaseballTimes2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.21 -11.19 -5.25 5.06 31.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.069 23.412 5.299 0.000189 ***
## Pitchers 8.017 2.722 2.946 0.012239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.52 on 12 degrees of freedom
## Multiple R-squared: 0.4197, Adjusted R-squared: 0.3713
## F-statistic: 8.678 on 1 and 12 DF, p-value: 0.01224
ggplot(data = BaseballTimes2017, mapping = aes(x = Pitchers,
y = Time)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x) +
theme_bw()
Extend the simple linear regression model above by including the total number of runs scored as an explanatory variable
pitcher_run_model = lm(Time ~ Pitchers + Runs, data = BaseballTimes2017)
summary(pitcher_run_model)
##
## Call:
## lm(formula = Time ~ Pitchers + Runs, data = BaseballTimes2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.789 -10.682 -2.683 5.260 34.211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.135 21.410 6.265 6.13e-05 ***
## Pitchers 2.787 3.528 0.790 0.4462
## Runs 3.262 1.600 2.039 0.0663 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.59 on 11 degrees of freedom
## Multiple R-squared: 0.5788, Adjusted R-squared: 0.5022
## F-statistic: 7.558 on 2 and 11 DF, p-value: 0.008605
How should you interpret the Runs coefficient in the
context of these data?
The Runs coefficient tells us that, based on this
sample, the model predicts that for each increase in Runs
by one, and assuming the PItchers value holds constant at
any given value, the time will increase by 3.262 minutes.
Is there still good evidence for a linear relationship between the number of pitchers and game duration, even after taking into account the number of runs scored in a game?
No. There’s a higher correlation between the number of Runs and the duration of the games than there is between the number of pitchers and the duration of the game, as evidenced by Runs’ p-value being power than Pitchers’.
Why does the effect of Pitchers appear smaller, once you account for the number of runs scored?
This is an multiple linear regression model, meaning the coefficients of the explanatory variables, which have their own relationship, are adjusted to account for the simultaneous effects – the effect of Runs ~ Pitchers and Time ~ Runs + Pitchers. This results in the effect of Pitchers looking smaller when Runs is accounted for.
library(Stat2Data)
library(tidyverse)
data("MothEggs")
moth_model = lm(Eggs ~ BodyMass, data = MothEggs)
summary(moth_model)
##
## Call:
## lm(formula = Eggs ~ BodyMass, data = MothEggs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -157.586 -17.187 3.162 25.790 67.960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.38 45.38 0.537 0.59423
## BodyMass 79.86 26.69 2.992 0.00492 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.75 on 37 degrees of freedom
## Multiple R-squared: 0.1948, Adjusted R-squared: 0.173
## F-statistic: 8.95 on 1 and 37 DF, p-value: 0.004916
ggplot(data = MothEggs, mapping = aes(x = BodyMass, y = Eggs)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x) +
scale_x_continuous(breaks = seq(1.2, 2.3, by = 0.1)) +
theme_bw()
slope_coeff = 79.86
std_error = sd(MothEggs$BodyMass)
alpha = 0.05
df = nrow(MothEggs)- 2
t_crit = qt(1 - alpha/2, df)
margin_error = t_crit * std_error
lower = slope_coeff - margin_error
upper = slope_coeff + margin_error
for a moth with a body mass of 1.3 grams, the possible range of eggs is about 100 to 150.
if an 84% confidence interval was constructed around the fitted slope coefficient from each sample, i would predict that 0.84*200 = 164 of the intervals would capture the true slope of the Eggs ~ BodyMass regression model.
library(tidyverse)
set.seed(1)
n = 20
# simulated_data = data.frame(dataset_id = rep(1:10000, each = n),
# x = rnorm(n, mean = 10, sd = 5)) %>%
# mutate(y = 3 + rnorm(n*10000, mean = 0, sd = 2))
# extract_slope_and_std_error = function(model) {
# reg_table = summary(model)$coefficients
# stats = data.frame(Slope = reg_table[2, "Estimate"],
# Std_Error = reg_table[2, "Std. Error"])
# return(stats)
# }
#
# ## generate slopes from simulated data grouped by id number, then summarize w/ slope and std error
# slopes = simulated_data %>%
# group_by(dataset_id) %>%
# summarise(extract_slope_and_std_error(lm(y ~ x)))
#
# head(slopes)
## t-value is the slope/std-error value
# slopes = slopes %>%
# mutate(t = Slope / Std_Error)
#
# head(slopes)
visualize the t-statistic distribution
# ## data is slopes, put on the x-axis the t-statistics
# ggplot(slopes, aes(x = t)) +
# ## make a histogram with 40 bins that shows density
# geom_histogram(mapping = aes(y = after_stat(density)),
# bins = 40) +
# ## do colors
# geom_function(color = "blue", fun = function(x) { dt(x, df =10000-2)}) +
# ## set the bounds for the x axis
# scale_x_continuous("t statistics for slope", limits = c(-5, 5))
qt() function looks at the quantiles of a distribution
these distributions are symmetrical, which is why these are the +-
versions of the same value
## has only 1% of results above it
qt(0.01, df = 10000-2)
## [1] -2.326721
## has 99% of all results above it
qt(0.99, df = 10000-2)
## [1] 2.326721
we do this by using a exponential distribution instead of a normal one
set.seed(1)
# bad_data = data.frame(dataset_id = rep(1:10000, each = n),
# x = rnorm(n, mean = 10, sd = 5)) %>%
# mutate(y = 3 + rexp(n*10000, rate = 3))
#
# head(bad_data)
# bad_model_slopes = bad_data %>%
# group_by(dataset_id) %>%
# summarize(extract_slope_and_std_error(lm(y ~ x))) %>%
# mutate(t = Slope / Std_Error)
slopes of exponential distribution
# ggplot(data = bad_model_slopes, mapping = aes(x = t)) +
# geom_histogram(aes(y = after_stat(density)),
# bins = 40, fill = "red", alpha = 0.5) +
# geom_function(color = "blue", fun = function(x) dt(x, df = n - 2)) +
# scale_x_continuous("t statistics for slope", limits = c(-5, 5))
I would have expected this distribution to not be unimodal and mostly symmetrical – rather, because the values are exponential, I would have expected the majority of values to be further from the center, so more of a bowl shape than a hill.
this value has 99% of all values in the distribution above it
# bad_model_slopes %>%
# summarize(q = quantile(t, 0.01))
this value has only 1% of values in the rest of the distribution above it.
this value is closer to zero than the 0.01 quantile, implying the distribution isn’t symmetrical
# bad_model_slopes %>%
# summarize(q = quantile(t, 0.99))
the t-distribution should be fully symmetrical but this one is not, meaning you can’t make two-way test inferences with accuracy.
library(infer)
library(ggplot2)
NCbirths = read.csv("https://bit.ly/ncbirths")
head(NCbirths)
## fage mage mature weeks term visits gained weight sex smoker
## 1 22 20 younger mom 32 premie 5 40 2.69 male 1
## 2 32 32 younger mom 38 full term 10 42 8.88 male 0
## 3 34 33 younger mom 38 full term 18 6 7.06 female 0
## 4 NA 18 younger mom 42 full term 15 27 7.44 male 0
## 5 NA 22 younger mom 40 full term 18 54 6.38 male 1
## 6 43 39 mature mom 38 full term 17 40 7.50 female 0
we are looking at the relationship between the age of the mother and the weight of the baby
plot = ggplot(data = NCbirths, mapping = aes(x = mage, y = weight)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE) +
theme_bw()
plot
the actual slope of the distribution
observed_slope = NCbirths %>%
specify(weight ~ mage) %>%
calculate(stat = "slope")
generate the random distribution using permutations
set.seed(54321)
permutation_dist = NCbirths %>%
specify(weight ~ mage) %>% ## variables and relationship?
hypothesize(null = "independence") %>% ## set the null hypothesis there's no relationship
generate(reps = 1000, type = "permute") %>% ## generate the random distributions
calculate(stat = "slope") ## calculate the slope of each distribution
visualize the random distribution
visualize(permutation_dist) +
shade_p_value(obs_stat = observed_slope, direction = "both")
## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Did you mean to use `annotate()`?
generate a p-value
p_value = permutation_dist %>%
get_p_value(obs_stat = observed_slope, direction = "both")
p_value
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.072
the slope of the line in figure 1 (first visualization) is the slope
between the original distribution, aka observed_slope:
## Response: weight (numeric)
## Explanatory: mage (numeric)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.0134
Using the visualize() function displays a histogram. Ignoring any red shaded regions, what does this histogram itself represent?
this visualization shows the distribution of the slope values for the 1000 random permutations of the dataset. so it shows the number of distributions (y-axis) with a certain slope value (x-axis).
What are the chances of observing a slope as extreme or more extreme as the one observed in these data, if the mother’s age and the babies birth weight are truly unrelated to one another?
given the mother’s age has no impact on the weight of the baby, the
chances of observing a slope as extreme as the one observed in the data
is the p-value of the original data in the randomization distribution,
aka p_value
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.072
according to this model, there is a non-significant probability that the mother’s age has a correlation with the baby’s weight.
library(Stat2Data)
library(ggplot2)
library(broom)
library(equatiomatic)
## load data
data("BaseballTimes2017")
head(BaseballTimes2017)
## Game League Runs Margin Pitchers Attendance Time
## 1 CHC-ARI NL 11 5 10 39131 203
## 2 KCR-CHW AL 9 3 7 18137 169
## 3 MIN-DET AL 13 5 10 29733 201
## 4 SDP-LAD NL 7 1 6 52898 179
## 5 COL-MIA NL 9 3 10 20096 204
## 6 CIN-MIL NL 21 1 10 34517 235
plot = ggplot(BaseballTimes2017, mapping = aes(x = Pitchers,
y = Time)) +
geom_point() +
theme_bw() +
geom_smooth(method = lm, se = FALSE, formula = y ~ x)
plot
a: Fit the regression model visualized above
baseball_model = lm(Time ~ Pitchers, BaseballTimes2017)
b: Print the model summary
summary(baseball_model)
##
## Call:
## lm(formula = Time ~ Pitchers, data = BaseballTimes2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.21 -11.19 -5.25 5.06 31.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.069 23.412 5.299 0.000189 ***
## Pitchers 8.017 2.722 2.946 0.012239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.52 on 12 degrees of freedom
## Multiple R-squared: 0.4197, Adjusted R-squared: 0.3713
## F-statistic: 8.678 on 1 and 12 DF, p-value: 0.01224
c: Interpret the model’s slope coefficient in a sentence.
This model indicates that for each increase in pitchers by one during a game, the duration of the game tends to last 8.017 minutes more.
Use the fitted model’s equation to compute the residual error for
Game #10, between the Los Angeles Angels and the Seattle Mariners, a
game where 9 pitchers were used (feel free to use pen and paper when
working with the equation). Finally, check your work using the
augment() function from the broom() package to
inspect the precise residual error.
isolate the coefficients
baseball_coefs = coef(baseball_model)
baseball_coefs
## (Intercept) Pitchers
## 124.068966 8.017241
generate the estimated time
## Game 10 time = slope * # of pitchers + error
g10_time_est = 8.017241*9 + 124.068966
g10_time_est
## [1] 196.2241
actual time:
## access baseball times at row 10, column 7 (game 10, time)
g10_time = BaseballTimes2017[10, 7]
g10_time
## [1] 184
calculate the residual by subtracting the estimate from the actual value:
g10_resid = g10_time - g10_time_est
g10_resid
## [1] -12.22413
calculate the residuals for all rows from the computer’s model
residuals = augment(baseball_model)
isolate game 10’s actual residual
## # A tibble: 1 × 1
## .resid
## <dbl>
## 1 -12.2